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Abstract. In this article we present the implementation of an environment 
supporting Levy's optimal reduction for the A-calculus ILev78l on parallel (or 
distributed) computing systems. In a similar approach to Lamping's one in 
ILam90l . we base our work on a graph reduction technique known as directed 
virtual reduction IDPR97I which is actually a restriction of Danos-Regnier 
virtual reduction IDR93| , 

The environment, which we refer to as PELCR (Parallel Environment for 
optimal Lambda-Calculus Reduction) relies on a strategy for directed virtual 
reduction, namely half combustion. While developing PELCR we have adopted 
both a message aggregation technique, allowing a reduction of the communi- 
cation overhead, and a fair policy for distributing dynamically originated load 
among processors. Additionally, we have used a set of other optimizations, 
e.g. allowing the maintenance of relatively low size for the manipulated data 
structures so not to incur problems related to their management at the appli- 
cation level or due to the management of large process memory images at the 
operating system level. 

We also present an experimental study demonstrating the ability of PELCR 
to definitely exploit parallelism intrinsic to A-terms while performing the re- 
duction. We show how PELCR allows achieving up to 70/80% of the ideal 
speedup on last generation multiprocessor computing systems. As a last note, 
the software modules have been developed with the C language and using a 
standard interface for message passing, i.e. MPI, thus making PELCR itself a 
highly portable software package. 



1. Introduction 

Jean- Jacques Levy formally characterized the meaning of the word optimal rel- 
atively to a reduction strategy for A-calculus, referring to it as the property that 
the strategy reaches the normal form (if it exists) and does not duplicate the work 
of reducing similar /3-redexes |Lev78) . 
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This characterization was formalized in terms of families of redexes that is, 
redexes with the same origin, possibly this origin being a virtual one in the sense 
that two families coming in a configuration producing a new redex originate a new 
family. Redexes belonging to different families cannot be successfully shared during 
reduction; whereas for two redexes in the same family, one could find an optimal 
strategy (i.e. reducing all of them in a single step). 

Data structures suitable for an implementation of optimal reduction were pre- 
sented a long time later |Lam90| ; the outcome reduction technique introduced by 
J. Lamping, known as sharing reduction, relies on a set of graph rewriting rules. 

In |GAL92| . Lamping's sharing reduction was proved to be a way to compute 
Girard's execution formula, which is an invariant of closed functional evaluation 
obtained from the "Geometry of Interaction" interpretation of A-calculus |Gir89j . 
This result stirred the research in the field of optimal reduction. Specifically, in 
|UR93j a graphical local calculus, namely virtual reduction (VR), was defined as 
a mechanism to perform optimal reduction by computing the Girard's execution 
formula. Such a calculus was later refined in |DPR97j , by the introduction of a new 
graph rewriting technique known as directed virtual reduction (DVR) . The authors 
also defined a strategy to perform DVR, namely combustion, which simplifies the 
calculus and can simulate individual steps of sharing reduction. 

In this article we describe a technique for the implementation of functional cal- 
culi. This technique exploits both locality and asynchrony of the computation which 
is typical in interaction nets f |Laf90| 'l and derives from the fine decomposition of 
the A-calculus /3-rule obtained through the analysis provided by the Geometry of 
Interaction. Specifically, we present the implementation of a Parallel Environment 
for optimal Lambda-Calculus Reduction (PELCR) which relies on DVR and on a 
new strategy to perform DVR that will be referred to as half- combustion (HC). 

Let us stress that any interpreter of an ML-likc functional language based on 
our technique ensures the execution of programs in a parallel (or distributed) envi- 
ronment in a way completely transparent to the user. 

To the best of our knowledge, our work is the first attempt for parallel implemen- 
tations of optimal A-calculus reduction. Actually in |Mac97| issues on the possibility 
of parallel implementations for Lafont's interaction nets are discussed. In that work 
Mackie is faced to problems of load balancing and fine grain parallelism. The so- 
lution proposed in Mac97 is a static analysis of the initial interaction net which 
aims at setting up a favorable initial distribution of the nodes among processors. 
His work is related to optimal reduction since optimal rules (e.g. in |GAL92p define 
an interaction system |Laf90j . However, contrarily to our work, it does not focus on 
optimal reduction. Another fundamental difference between our work and Mackic's 
study is that our approach is dynamic: load distribution is decided at run-time and 
the message passing overhead is controlled dynamically as well. Also, our imple- 
mentation embeds a set of other optimizations further allowing improved run-time 
behavior, e.g. for what concerns memory performance at both the application level 
and the operating system level. 

The implementation has been developed with the C language using a standard 
interface, namely MPI, for supporting message passing functionalities among pro- 
cesses involved in the computation. These peculiarities make PELCR a highly 
portable software package, easy to install on a wide set of, possibly heterogeneous, 
computing platforms. 
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We also report the results of an experimental evaluation of our software package. 
By the experimental data, we show how it has the ability to definitely exploit any 
form of parallelism intrinsic to the reduction of A-tcrms. As a result, we obtain 
up to 70/80% of the ideal speedup, i.e. the ideal acceleration as compared to a 
sequential case, while performing A-term reductions on last generation multipro- 
cessor systems. This, in its turn, also allows decreasing the wall-clock time for the 
reduction from several tens of seconds to few seconds. This points out how PELCR 
has the potential to cope with response time requirements for the satisfaction of an 
interactive end-user even in case of jobs that would require large computation time 
if executed in a classical sequential fashion. 

We analyze the problem (parallel implementation of functional calculi) from a 
pragmatic point of view, and the theory of (directed) virtual reduction is here 
considered mainly for how it can give rise to parallel dynamics. However, while 
recalling such a theory, we also propose a few optimization rules allowing an increase 
in the effectiveness of the DVR approach (see Section |2~31 . which have been taken 
into account while developing the implementation. The remainder of the article is 
structured as follows. In Section [3 we recall DVR. In Section the HC strategy for 
DVR is introduced. In Section 01 we report the description of our implementation. 
The experimental results are reported in Section [S] 

Acknowledgments. The project of a parallel and optimal interpreter for A- 
calculus started as a joint effort between the University of Paris 7 and the "Istituto 
per le Applicazioni del Calcolo" in Rome (see "Optimal and parallel evaluations in 
functional languages" CNR/CNRS - Bilateral Project n.3132 - 1996/97); some of 
the ideas used in our implementation arose thanks to discussions of the first author 
with Vincent Danos. The authors also wish to thank Carlo Giuffrida for his support 
while developing some software modules. 

2. From Lambda-Terms to Directed Virtual Reduction 

As pointed out in the introduction we deal with an evaluator for A-terms based 
on DVR, to be executed on parallel/distributed computing system. The pioneer- 
ing ideas contained in Levy's work on optimal reduction where finally realized by 
Lamping and then related to semantical questions about operational aspects of 
computations. In fact almost at the same time Girard gave the foundations of an 
outstanding mathematical base for the study of operational semantics. 

Just to enumerate them we should cite Lamping's first work on sharing reduction, 
the connection with Geometry of Interaction discovered by Gonthicr and finally the 
work of Danos-Rcgnicr on VR and DVR. There is no way to get a complete and 
self-contained presentation of all this material, therefore, for a complete survey 
about the optimal implementation of functional programming languages we refer 
the reader to |AG98| . 

Here we shortly recall VR and the Geometry of Interaction fSection 12.1(1 : then 
we will give a full presentation of DVR fSection l2.2(l . with the introduction of some 
properties (Section 12. 3|) which will be taken into account in the implementation. 
To ease the comprehension of this reduction technique and to make a more direct 
connection with Lamping's graphs we finally present an encoding of such graphs 
into directed virtual nets fScction |2.4|) . 

The basic ingredient in Gonthier and Danos-Rcgnicr works is the use of the 
invariancc of the execution formula as a consistency criterion for the reduction 
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Figure 1. An explicative picture for execution paths. 

technique. The execution formula associated with a term T with free variables 
{xi, . . . ,x n } is given by the set of its border-to-border weighted straight paths in 
its dynamic graph Rt- Any node pt in the border node is either associated with 
one free variable xi or, if i = 0, po represents the root of the term, see for example 
Figure Q A straight path in a directed graph is a path that never bounces back in 
the same edge. 

The execution formula of Rt is: 

exort) = £ wfaj) 

where W{.) is a morphism from the involutive category of paths V(Rt) to the 
monoid of the Geometry of Interaction, so that for any straight path 0^ from pi to 
Pj, W{4>ij) is an element of that monoid. 

The preliminary step for our work is the Danos and Regnier's construction of a 
confluent, local and asynchronous reduction of A-calculus, derived from a semantic 
setting based on a unique type of move (simple enough to be easily mechanized) . 
Their graph reduction technique, namely VR, can be explained also as an efficient 
way to compute the execution formula. The one and only reduction rule is the 
composition of two edges in the graph as described in Figure |21 Whenever two 
edges of the virtual net are composable (i.e. the product of their weights is non- 
null), VR derives from them a new edge. The original edges are then marked by 
the rest of the composition, here denoted by weight within brackets. 




y*x 



Figure 2. Composition performed by VR. 

The algebraic mechanism corresponding to the rest is called the bar; it was 
introduced in |DR93j to ensure the preservation of Girard's execution formula. 
Note that VR induces bars of bars by definition; this is shown in Figure DVR, 
presented in |DPR97| . was designed in order to avoid bars of bars, thus allowing 
any implementation to use simple data structures for representing edges. 
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Figure 3. VR originating bars of bars. 



2.1. Geometry of Interaction. The basic geometrical construction consists of a 
directed graph with weights in the dynamic algebra. The most important point 
is that the computation of Girard's "Execution Formula" is performed in a way 
that appears to be the natural candidate for parallel computation. In order to get 
a computational device from this graphical calculus, a suitable strategy has been 
introduced: by means of the combustion strategy it was proved that not only the 
mechanism of DVR computes the execution formula but also that it can do it in 
the same way as Lamping's algorithm for sharing graphs. 

The Geometry of Interaction basic step is the introduction of a suitable algebraic 
structure, in view of the modeling of the dynamics of the reduction. This structure 
can be thought as of the set of partial one-to-one maps u with composition. The 
structure is then enriched with partial inverses u*, the codomain operation (u), and 
the complementary of the codomain [u] . Axioms for such a structure are formally 
introduced below. 

Definition 2.1. An inverse monoid (see |Pet84j ). or for short an im, is a monoid 
with an unary function, the star, denoted by (.)*, with 



We denote by (u) the idempotent uu* . With this notation the last equation 
becomes (u)(v) = (v)(u) and the one before becomes (u)u = u. 

Definition 2.2. A bar inverse monoid, or for short a bim, is an im with a zero, 
denoted by 0, and an unary function, the bar, denoted by [.], with 



(1) 
(2) 
(3) 
(4) 




v*u 



uu*u 



vv*uu* . 



uu*vv 



(5) 
(6) 



[1] = and [0] = 1, 
u[v] = [uv]u. 



Bim's axioms entail: 



(1) [u]u = 0, in fact u[l] = [u l]u thus = [u]u; 

(2) [u][u] = [u], in fact [u][u] = [[u]u}[u] = [0][u] = [u]; 



(i 



M. PEDICINI, F. QUAGLIA 



(3) [«]* = [u], [«]*[«] - [«]*[«] [«]*[u] = [«]*[«] [u]*[u]*[u] = [u]*[u][u][u]*[u]*[«] = 
[u]*[u][u]*[u][it][u]* = [u]*[u][u][it]* = = [u]* then we have [u] = 

[«][«]*[«] = HH'HKH = [«][«]*[«]*[«] = M*MMM* = M*MM* = 
[«]*; 

(4) and to = iff v[u] = v. In fact v[u] = [vu]v = [0]v = v, on the other hand 
if v[u] = v then v[u]u = vu but [u]u = thus v[u]u = = vu. 

Now we give the construction of the free bim generated by a given im. So let 
S be an im, and Z[5] denote the free contracted algebra over S with coefficients 
in Z (the ring of integers). In other words Z[S] is the algebra of maps from S to 
Z with finitely many non-zero values. In other words Z[S] is the algebra of linear 
combinations over S with coefficients in Z. 

For any such linear combination, s = Y^ n i s i> define 

(7) s* = ]T n lSl \ [s] = 1 - (s) = 1 - ss\ 

Define the complementary closure of S in Z[S], denoted by [S], as the monoid 
generated in Z[S] by the union of S and {1 — (it), «eS}. 

Proposition 2.3. [S] is an inverse monoid with (.)* defined as in 

Proof. The proof is a straightforward calculation. In fact we have that for every 
element s € [S] can be written as a combination 

(8) S = Si[ui]s 2 ■ ■ ■ [U n ]s n +V 

Let us introduce the length \s\ of an element s £ [S] as the smallest n such that 
©• 

Now we prove that properties (l)-(4) in Definition 12 . II hold for [S]. 

(1) First we prove that for every u,V £ [S] we have (uv)* = v*u*, by double 
induction on the lengths of u and v. 

If |u| = and \v\ = then u,v £ S and the property holds by definition, 
because S is an inverse monoid. 

Let be \u\ = and for every v such that \v\ < n the property holds, then 
we prove that for every v' such that \v'\ = n + 1, (uv')* = v'*u*. 

Let be v' = w[u„ + i]s ra +2 and u = s, then 

(uv')* = (sv[u n+ i]s n+2 )* = 

= (sv(l - (u n+1 ))s n+2 )* = 
= (svs n+2 - sv(u n+1 )s n+2 )* = 
= (svs n+2 )* - (sv(u n+ i)s n+2 )* = 
= s n+2 *v*s* - s n+2 * (u n+1 )v* s* = 
= s n+2 *[u n+ i]v*s* = v'*u*, 

in fact |f's„ +2 | = n and from (u n+ i)s n+2 £ S we have \v' (u n +i) s n+2 \ = n; 
thus by induction hypothesis (svs n+2 )* = s n+2 *v*s* and (sv' (u n+ i) s n+2 )* 
= s n+2 *(u n+ i)v'*s*. Now suppose the property holds for any u such that 
|«| < n and for every v £ [S]. We show that it holds for u' £ [S] such that 



PARALLEL/DISTRIBUTED OPTIMAL REDUCTION 



7 



=71+1. 

(u'v)* = (u[u n+ i]s n+2 v)* = 

= (us n+2 v - UU n+ lU n+ l* s n+2 v)* = 
= (us n+2 v)* - (llU n+1 U n+ l* S n+2 v)* , 

we have |ws„+2| = \u\ = n and \uu n +iu n +i* s n + 2 \ — n by induction hy- 
pothesis, we have (us n+2 v)* — v*s n+2 *u* and 

{uu n+1 u n+ i* s n+2 v)* = V*S n+2 *U n+ lU n+1 *U* , 

thus 

(u'v)* = v*s n+2 *u* - V*S n+2 *U n+1 U n+1 *U* = 
= v*s n+2 *[u n+ i]u* = 
= v*{u[u n+1 ]s n+2 )* = v*u'*. 

(2) Now we prove that u** = u. Again by induction on the length It is 
clear that if \u\ = then u G S and u** = u by definition of inverse monoid. 

Suppose that the for every u G [S] such that |u| <nwc have u** = u, and 
consider v! such that \v!\ = n+ 1. We may write u' = u[u n +i]s n + 2 , then we 
have u'* = (u[u n+ i]s n + 2 )* by the previous proof we have (u[u n +i]s n + 2 )* = 
s n+2 *[u n+ i]u* and 

u'** ={s n+2 *[u n+1 ]u*)* = 
=u**[u n+1 ]s n+2 ** = 

=u[u n +l]s n+ 2, 

since by induction hypothesis u** = u. 

(3) Let us prove that for every u G [S] we have uu*u = u. Case |u| = 0, implies 
u G S and follows from the definition of S. 

Suppose, uu*u = u for every u G [S] such that \u\ < n. Consider 
yf = u[u n+ i\s n+2 such that \u\ = n and \u'\ = n + 1. Then 

u'u'*u' = u[u n+1 ]s n+2 (u[u n+1 ]s n+2 )*u[u n+1 }s n+2 = 

= u[u n +l]Sn+2Sn+2 i '[u n+ l}u*u[u n+ i}s n+2 = 

= u(l - (u n+1 ))s n+2 s n+2 *(l - (u n +i))u*u(l - (u n+ i))s n+2 = 

= US n+2 S n+2 *U*US n+2 

- u(u n+ i)s n+2 s n+2 *u*us n+2 

- us n+2 s n+2 * {u n+ i)u*us n+2 - us n+2 s n+2 *u*u{u n+ i)s n+2 

+ u{u n +l)s n +2Sn+2 k '{u n +l)u*US n+2 
+ U (u n+ l)s n+2 S n+2 *U*u{u n+ l) S n+2 
+ US n+2 S n+2 * {u n+ l)u*u(u n+ l)s n+ 2 

- u{u n+ l)s n+ 2S n+ 2* {u n+ l)u k u{u n+ l)s n+ 2 = 
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= USn+2 - US n +2S n +2*U*US n+ 2S n+ 2*(Un+l)Sn+2 
)u*US n +2 

- US n+2 S n+2 *U*u(u n+ t}s n+2 

+ u{u n+ l)s n+2 S n+2 * {u n+ l)u*US n+2 
+ u(u n+ l}s n+2 S n+2 *U*u{u n+ -i)s n+ 2 
-\- US n +2 s n+2 

- u{u n+ i)s n+2 S n+ 2* {u n+ i)u*u{u n+ i)s n+ 2 = 
= US n+2 ~ US n+2 S n+ 2* (u n+ i)s n+2 

- US n+ 2S n+ 2* U* US n+ 2S n+ 2* (u n+ i) S n+ 2 
~ US n+2 S n+ 2*U*u(u n+ l)s n+ 2 

+ US n+ 2S n+ 2*U*US n+ 2S n+ 2*(u n+ i)s n+2 

l)(Un+l)s n +2 
+ US n+ 2S n+ 2*(u„ + l)u*u(u n+ l)s n+ 2 

- u(u n+ i)s n+ 2S n +2* (u n +l}u*u(u n+ l)s n+ 2 = 

= US n+ 2 ~ 3u{u n+ i)s n+2 + 3u{u n+ i)s n+2 - u{u n+ i)s n+2 = 

= - u{u n+ i)s n+2 = 

= u[u n + 1 ]s n+2 
1 

= u , 

we used us n+ 2S n+2* u * u $n+2 = us n+2 by applying the induction hypothesis 
to u. 

(4) We prove that for any u, v G [S], uu*vv* = vv*uu* . Let us fix \u\ = and 
let us show the property for any v by induction on \v\. 

So \u\ = and if \v\ = then u,v G S and there is nothing to prove. 

Let us apply induction hypothesis, and for a fixed n suppose the property 
holds for any |u| = and v such that \v\ < n. We then prove the property 
holds for any v' such that \v'\ = n+ 1, in this case we may suppose that 
v' = v[u n+ i]s n+ 2, moreover by definition of [ti n +i] = 1 — Un+i^n+i* an d 
by distribution, 

SS*v[u n+1 }s n+2 S n+2 *[u n +l]v* = SS*V [u n+ \] [u n+ x]s n+2 S n+2 *V* = 

= SS*v[u n+1 ]s n+2 S n+2 *V* = 
= SS*V{1 - {u n+1 ))s n+2 S n+ 2*V* = 
= SS*VS n+2 S n+ 2*V* - SS*v(u n+ i)s n+ 2S n+ 2*V* . 

In the last expression we have \vs n +2\ = n, thus from the basis of the 
induction 

SS*VS n+2 S n+2 V* = VS n+ 2S n+ 2V*SS*. 

In a similar way (it„+i), s n +2 G S and so 

v{u n+ ijS n+ 2S n+ 2*V* = v{u n+ i){u n+ ijS n+ 2S n+ 2*V* = 
= v(u n+1 )s n+2 S n+2 * (U n+ l)v* , 

and \v(u n+ i)s n+2 \ = n, thus 

SS*v(u n+ l)s n+ 2S n+ 2*V* = v(u n +l}s n+2 S n+2 * (u n +l)v* SS* 
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by induction hypothesis. 

Now we complete the proof by an induction on |it|, suppose that the 
property holds for any u such that \u\ < n and for any v G [S] and consider 
u 1 = u[u n+ i]s n+2 , then 

u[u n+1 }s n+2 S n+2 *[u n + l}u*VV* = 

= u[u n+ l][u n+ l}s n+2 S n+2 *U*VV* = 
= u[u n+ l}s n+ 2S n+ 2*U*VV* = 

= US n+ 2S n+ 2*U*VV* - u(u n+ i)s n+ 2S n+ 2*U*VV* = 

= US n+2 S n+ 2*U*VV* - u(u n+ l)s n+2 S n+ 2* (u n+1 )u*VV* , 

since [u n +i]sn+2 G S we have |u[Mn+i] s n+2| = n and by induction hypoth- 
esis we obtain 

VV*US n+2 S n+2 *U* - VV*u(u n+1 )s n+2 S n+ 2* (U n+1 )u* = 

= vv*u[u n+ i]s n+ 2S n+ 2*[u n+1 ]u* . 

□ 

Definition 2.4. Define the bar closure of S, denoted by [S]^, to be the im obtained 
by uj iterations of the complementary closure, that is: let So = S and SVi+i = [S n ] 
then [S]^ = \J n >o S n- 

Proposition 2.5. [S]u, is a bar inverse monoid with [.] defined as above. 

Proof. For every u,v G [S] w there exists n s.t. u,v G [S] n so uv, u* and [u] belong 
to [S] n and so to [S] w ; for the same reason bim's axioms are satisfied. □ 

Definition 2.6. The monoid L* of the Geometry of Interaction is the free monoid 
with a morphism !(.), an involution (.)* and a zero, generated by p, q, and a family 
W = (wi)i of exponential generators such that for any u G L*: 

(9) x*y = S xy for x,y = p,q,Wi, 

(10) \(u) Wi = w^iu), 

where ej is an integer associated with Wi called the lift ofwi, i is called the name 
of Wi and we will often write w i e ^ to explicitly note the lift of the generator. 

Equations w ^ be called of annihilation and fluil are called equations of 
swapping. 

Orienting the equations I|9I10(I from left to right, one gets a rewriting system 
which is terminating and confluent. The non-zero normal forms, known as stable 
forms, are the terms ab* where a and b are positive (i.e. written without *s). The 
fact that all non-zero terms are equal to such an ab* form is referred to as the "ab* 
property" . From this, one easily gets that the word problem is decidable and that 
L* is an inverse monoid. 

Every computation, from now on, will take place in the bar closure of L* in 
Z[L*], which we denote by [L*] w . Since, as said, this is a bim, results in |UR93j . 
which were stated and proved for any bar inverse monoid, apply with no further 
ado. Note that equalities in [L*] w and in Z[L*] are also decidable by rewriting to 
stable form. 

Set [b\, . . . , b n ] = 1 — bibi* — ■ ■ ■ — b n b n *; [b\, . . . , b n ] is an idempotent iff the b^s 
are orthogonal that is, (bi){bj) = 0. 
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Lemma 2.7 ((superposition)). Let a, b and c be positive monomials in L* such 
that (a)(b), {b){c} and (a)(c) ^ 0, then (a)(6)(c) ^ 0. 



Definition 2.8. Let a weight on a directed graph be a functor W from the directed 
graph's involutive category of paths to [L*] w . 

Most of the time, we will simply write <j> for W(</>) to ease the reading of definitions 
and proofs. 

We will say that a coincides with f3 or cquivalently that a and (3 are coincident 
if they have the same target node. 

An edge (3 is called a counter-edge of a along r if f3 ^ a and r is a directed path 
from the a's target to the /3's one, not ending with /3, such that (a)(r*/3) / 0. 

Two coincident counter-edges a and /? are said to be composable (i.e. (a) ((3) ^ 
or equivalently they are reciprocally counter edges along the empty path). 

Definition 2.9. A straight path is a path that contains no sub-path of the form 
4>4>* , i.e. that never bounces back in the same edge. 

A weighted directed graph is said to be split if any three coincident paths of 
length one <p\, <f>2 and 03 are such that (4>i) (fa) (fa) = 0; it is said to be square-free 
if for any straight path cj>, (fxj) = 0. 

Definition 2.10. A weighted directed graph is said to be a virtual net if it is split 
and square-free. 

Splitness can be rephrased as: any three paths <p\, <f>2 and </>3 such that none is 
prefix of another are such that 



2.2. Directed Virtual Reduction. 

Definition 2.11. A directed virtual net R is an acyclic virtual net such that for 
each edge a: 

A. a — [bi, . . . , b n \a, where a, b\, . . . , b n are positive monomials of L*. We will 
denote by a + the weight of a without its filter . . . , b n ] that is, the monomial a. 

B. for any i ^ j and for any two counter-edges Pi, P2 °f along T\, t 2 



Given two coincident counter-edges a and (3, with weights [bi,...,b n ]a and 
[ai, . . . , a m ]b, then DVR originates a new node and two new edges linking that 
node to the sources of a and f3. These new edges have, respectively, weights b' and 
a 1 where a'b'* is the stable form of b*a\ this is shown in Figure 

Note that new edges produced by a step of reduction have positive weights so 
that the resulting computation of the execution formula is more appealing for the 
implementation, as opposed to VR, by the fact that bars are not propagated on 
residuals. 



Proof. See |DPR97| . 



□ 



<0l)(02>(03> = 0. 



(biHbj) =0 

(T 1 *Pt)(T2*(3t)=0 



R {a;b l ;b J ) 

l fl (a;6 i; ri*/3i) 

2 R (a;T 1 *l3 1 ;T 2 *(3 2 ) 
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Figure 4. Composition performed by DVR. 

Definition 2.12. Given two composable edges a and (3, the two edges a' and /?' 
generated by one step of DVR are called residuals of a and (3 respectively. We will 
denote these residuals by 

dvr(a,/3) = (ft, a'). 

Lemma 2.13 ((augmentation)). Let R be a directed virtual net andj2 be a counter- 
edge of 71 along r in R, then 

<7i><t*72)H7i + ><t + *7 2 + >^0 ; 

or equivalently 

7l*r*72 = 7i + *t+*7 2 + ^ 0. 
Proof. See |DPR97j . □ 

In |DPR97| . it has been proved that DVR is sound w.r.t. Girard's execution 
formula: 

Proposition 2.14 ((Invariance)). The execution formula is an invariant of DVR. 

Proof. See |DPR97 |. □ 

2.3. Optimization Rules. In this section we prove two properties in order to 
make DVR more effective, which arc also exploited while developing the implemen- 
tation. These properties are immediate consequences of the orthogonality condi- 
tions satisfied by virtual nets and help to gain effectiveness in the computation and 
to increase the intrinsic parallelism. 

Definition 2.15. Given a directed virtual net R, a total edge a is any edge with 
at most one counter- edge (3 that is, (3 is the only edge such that 

(a)(T* a0 p)^O. 

In this case we say that a is total w.r.t. (3; if a has no counter edge, it is called 
ghost. 

This relation is not symmetric: for any two coincident edges a and (3 such that 
a is total w.r.t. (3, we observe that it is possible that (3 is not total w.r.t. a, in fact: 
suppose f3 = 1, then any 7 coincident with (3 is composable with (3 (thus (3 cannot 
be total w.r.t. a) but a is not composable with 7 otherwise it would contradict the 
splitness condition, thus a is total w.r.t. f3. 

Proposition 2.16. Given two composable edges a and (3 such that dvr(a, (3) = 
(l,a'), then a is total w.r.t. (3. 
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Proof. In order to get a contradiction suppose that there exists a counter-edge 7 
of a along the directed path r. Suppose r is the empty path, therefore 7 coincides 
with a and 0*7 ^ so that dvr(o,7) = (7', a"); in this case we compute ("/)(a)(f3) 
and we have 77*aa* (3(3* — ja"-f'*a' i ' (3* ^ since it is a stable form, and we get a 
contradiction with the splitness condition. 

If t is not the empty path, we can apply the same argument to the residual 7' 
of the reduction sequence along the directed path r, and derive the property by 
lemma 12 . 1 31 applied to a and 7. □ 

Proposition 2.17. Given three coincident edges a, /3, 7 such that 

dvr(o,/3) = (/?', 1) 

with a and 7 composable, then the residual of"/ is not composable with /3' . 

Proof. Suppose 7'*/?' ^ so this product has a stable form, let say then 
(j)(a)(P) = jj*aa*/3i3* = -fa"i*f3'f3* = 70" £'7'* /?* ^ and we get a contradic- 
tion with the splitness condition. □ 

Corollary 2.18. (soundness of the optimization of one) //dvr(o,/?) = (l,o') ; 
then no further composable edge 7 can give 1 as residual of the composition with a. 

Proof. The two residuals in the source of a have weight 1 so that they are compos- 
able and this is a contradiction with the proposition 12. 171 □ 

This corollary allows an optimization rule, in fact the configuration produced 
by the DVR step dvr(o,/3) = (1, a') acts as a compound operator: the edge with 
weight 1 is there just to say that all the coincident edges have to be transferred on 
the source of the edge a', so we propose to transform this configuration by removing 
the edge [3' with weight 1 and using the edge a' for linking the target of /?' and the 
target of a' (see Figure El- 



Figure 5. An optimization rule. 

Now we will prove another property of DVR, which states that when two edges 
a[ and a' 2 are residuals of directed virtual reduction of ct\ and 02 against the same 
edge (3, they are coincident on the source of (3 (evident by definition of a DVR step), 
but are not composable because of splitness: 

Proposition 2.19. Given an edge f3 composable with two coincident edges ct\ and 
a.2 we have that {cx'i){a' 2 ) = 0, where a[ is the residual of ai and a' 2 is the residual 
of a 2 - 

Proof. Suppose (a' 1 }(a' 2 ) 7^ then we have, a^a'^a^a^* an( i so a 'i* a 2 = a 2 a i* 7^ 
0. 

By augmentation lemma we have 

(a 1 ){f3)(a 2 ) = {a+)([3 + }(at), 
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this is 

and by reduction we obtain 



a 1 a 1 */3/3*a 2 a 2 * 



Ql II II* Oil* * 

u\P a 2 a 1 p a 2 

and so this is a non null stable form so that it is different from zero. 



□ 



Property 12.191 allows the implementation of another optimization rule. Specifi- 
cally, we know that every new node v created after a DVR step is the source of only 
two edges, say f3\ and (3 2 , therefore all the edges coincident in v can be separated 
in two sets: the residuals of a DVR step involving (3\ and the residuals of a DVR 
step involving [3 2 . Each edge in a set is orthogonal to any edge in the same set, 
therefore there is no need to perform DVR steps between edges belonging to the 
same set since the composition will actually produce a null result. 

2.4. Translation of Sharing Graphs into Directed Virtual Nets. In order 
to solve the problem of the pairing of duplication operators Gonthier et al. added 
to the sharing graphs a local level structure. Each operator is decorated with an 
integer tag that specifies the level at which it lives. Furthermore in order to manage 
these levels a set of control operators is required. 

More precisely sharing graphs are non-oriented graphs built from the indexed 
nodes represented in Figure These nodes are called sharing operators and dis- 
tinguished in two groups. The first group includes the operators in the original 
Lamping's work: application, and abstraction; the second one is constituted by a 
family of nodes of the same kind (the so called muxes) accordingly to the following 
definition: 





application 



abstraction 



Figure 6. Sharing graph operators. 



Definition 2.20. A node mux or multiplexer is a node with an arbitrary number 
of premises each one having a name n and a lift l n ; like the other nodes, muxes 
have an index of level i. 

The translation of a sharing graph with muxes is defined by induction: 

Definition 2.21. A sharing graph M with root x and context yi, ■ ■ ■ ,y n is trans- 
lated into a directed virtual net in the following way 



yi 




where bullets indicates ports of a sharing graph, 
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If x is a link between two ports with no node 

o 



• If x is a port of an abstraction 

-A. 




• If x is a port of an application 

T. 




• // j/j is a port of a mux 





When the nodes introduced by the translation present the configuration de- 
scribed in the next definition they are reduced by amalgamating edges as in Figure 
□ 

Definition 2.22. A node with n coincident edges a±, .. . , a n and an edge (3 with 
source the target of the on 's is erased and all the on 's are replaced by edges a[ where 
the source of is the source of cti , the target of a[ is the target of (3 and the weight 
of a[ is a>if3. 





Figure 7. Amalgamation of edges. 

With the help of an example we show how to change a A-term into a directed 
virtual net. In Figure|Sl starting from the syntactic graph of the A-term representing 
the Church numeral 2 applied to the identity that is, by using Krivine's notation, 
(Xf\x(f)(f)x)Xxx, we obtain a sharing graph by adding the control operators, 
expressed in the multiplexer syntax, and annotating each node by level indexes. 
Then edges are oriented, unfolded and labeled with monomials in L* in accord to 
the rules expressed by Definition 12.211 and Definition 12.221 Last step consists of 
grouping together arrows going in the same direction, the result of this operation 
is a directed virtual net, see Figure [5] 

Let us check that the obtained net is indeed a directed virtual net: 
• it is obviously a directed graph with no circuits, 
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Figure 8. Representation of a A-term. 




Figure 9. Encoding of a sharing graph into a directed virtual net. 



• squarc-frccness can be proved by induction on the translation of A-tcrms, 

• splitness has to be verified for all the triples of coincident edges; as an 
example we explicitly test the splitness condition for three coincident edges: 

{q){qpw 2)(qq) = qq* qpw 2W 2P* q* qqq* q* = qpwo 2 w 2P* qq* q* = 
because of p*q = 0; the rest of this verification is left to the reader. 

3. Half Combustion Strategy 

In |DPR97j , a strategy called combustion is presented in order to organize DVR 
in such a way that no filter must be kept. This strategy works on full directed 
virtual nets that arc directed virtual nets where each edge is either ghost (see 
Definition 12. 15(1 or has a positive weight. 

Since a ghost edge is an edge for which no more compositions will occur, sources 
of ghost edges never receive residual edges of ghost edges, thus let us define the 
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(out-) valence of a node as the number of non-ghost edges having that node as 
source. 

The combustion strategy of a full net starts from a node v of valence zero (i.e. 
with no future incoming edge or equivalcntly having only ghost outgoing edges) 
and composes all the pairs of coincident counter-edges on v as an atomic action. 
Using the combustion strategy we can give up filters because after the composition 
is performed, all those edges become ghost edges. 

From the point of view of a parallel implementation, the drawback of this strategy 
is that the composition of the coincident counter-edges can be started only when 
a node becomes of valence zero. More specifically, in case many processes are 
used to perform DVR (recall this is desirable anytime we want to fully exploit the 
computing power of parallel or distributed systems equipped with a large number 
of processors), we might incur the risk that, at a given time instant, only a subset 
of those processes host nodes of valence zero. In such a case, all the other processes 
cannot simultaneously proceed with DVR steps (i.e. they need to wait until some 
node they host becomes of valence zero), thus limiting the degree of parallelism 
while performing the reduction. 

We define below the HC strategy that like combustion does not require to keep 
filters and, in addition, allows the composition to be performed even on nodes 
having valence greater than zero, thus allowing high degree of parallelism. HC relies 
on the following notion of semifull directed virtual net which is a generalization of 
the notion of full directed virtual net. 

Let us call semifull directed virtual net a directed virtual net in which each 
edge either is weighted by a positive monomial (i.e. its weight has no filter) or 
all its coincident counter-edges are weighted by a positive monomial (i.e. it can 
be composed exclusively with edges having a positive weight). An example of a 
node in a semifull directed virtual net is shown in Figure 1101 In this example, 
the coincident counter-edges of edges with weight [an, . . . , ayj&i are among those 
edges weighted with 01, a%, a rn . 




Figure 10. A node in a semifull directed virtual net. 

Below we give the definition and provide the soundness of the HC strategy. 

Definition 3.1. Given a composable edge a with positive weight in a semifull 
directed virtual net R, we have to consider two cases: 

(1) if a has no non-positive coincident counter-edge and a positive one f3, then 
the half combustion strategy (HC) performs the composition of (3 with a 
and possibly with every non-positive edge composable with (3; 

(2) if the set {/3i, . . . , /3 n } of non-positive edges composable with a is non-empty 
then HC performs all the possible compositions of a with the fys. 
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Proposition 3.2. If R' is obtained from the directed virtual net R by the HC 

strategy and R is semifull then so is R' . 

Proof. Consider an edge a having positive weight a as in the Definition 13.11 and 
suppose we stay in the second case of the Definition, all the composable edges with 
non positive weights coincident with a are the Pi's with weights [an, . . . , aij t ]bi for 
i = 1, . . . , n as in Figure 1111 



o 




Figure 1 1 . Edges in a semi- full node. 

If we apply a step of the HC strategy by performing a DVR step between a and 
Pi for 1 < i < n, we obtain 

dvr(a,/3j) = 

where the weight of a is [b\, . . . , b n ]a and the weight of Pi is [an . . . a i j i , a]bi and 
the two new edges P' and a' have a positive weight, see Figure IT21 

Therefore, now the set of the coincident filtered edges has been enlarged with a, 
but a is no more composable with the Pi 's because of its filter and all the generated 
edges «i's and pi's have positive weights by definition of DVR. As a consequence, 
all the coincident filtered edges (including a) are not composable with each other. 
Thus the obtained directed virtual net is semifull. 

If we stay in the first case of the definition 13. II performing the composition of P 
with all its non-positive coincident counter-edges we obtain the same configuration 
as in the previous case, moreover we compose a with (the residual of) P and so all 
the non-positive edges incident in the node are not further composable. Note that 
the set of non-positive edges composable with P can possibly be empty, in this case 
HC just composes a and p. 

a 




Figure 12. Edges after the composition performed by HC. 
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We recall that the translation presented in Section l2~4l associates with any A-tcrm 
a full directed virtual net (see also |DR93I lGAL92j ). As full nets are particular 
instances of semifull ones, HC actually represents a reduction mechanism for A- 
calculus. 

Beyond the exploitation of parallelism, another interesting property of HC is 
that we can separate the edges ending on a node in two distinguished sets. In other 
words, the strategy associates a mark with each edge: incoming or combusted. 
When created, edges are marked as incoming. One step of reduction consists of 
picking an incoming edge a and performing all the compositions with coincident 
combusted edges. Then a is marked as combusted. 

Note that an edge may be marked as combusted even when it has a positive 
weight, namely if all the combusted edges coincident with a are not composable 
with a, with the particular case where the set of combusted edges coincident with a 
is empty as in case 1 of Definition 13. II On the other hand, at any step any incoming 
edge has a positive weight. As an edge is marked combusted only after having been 
(successfully or not) composed with every coincident combusted edges, one easily 
sees that two combusted edges are never composable. Thus this suggests that we 
can organize the computation in such a way that the only meaning associated with 
filters is about the belonging of an edge to the first or to the second set (thus, like 
in the combustion strategy, filters can be actually discarded). We have embedded 
this simplification among others in the parallel implementation we present in the 
next section. 

4. The Implementation 

This section is devoted to the description of the implementation of PELCR and 
is organized as follows. We first provide the outline of data structures we have used 
and the high level description of the parallel program. Then we enter details on 
any aspect and/or any optimization characterizing the implementation. Actually, 
the material presented in this section describes the implementation independently 
of the specific language used to develop it (the C language for our case). 

4.1. Data Structures and Code Organization. Each processor i of the archi- 
tecture hosting PELCR runs a process Pi which is an instance of the executable 
code associated with the parallel program. We assume there is a master process, 
that for the sake of clarity will be identified as Pq. All the other processes will 
be referred to as slave processes. Processes communicate exclusively by exchang- 
ing messages and the communication channels among processes are assumed to be 
FIFO (this is not a limitation as the most widely used message passing layers, such 
as PVM or MPI, actually provide the FIFO property to communication channels). 
We call pending message any message already stored in the communication channel, 
which has not yet been received by the recipient process. 

We associate with each node v an identifier, namely id(v). Each edge e = {v\, 1)2) 
is therefore associated with the pair of node identifiers (id(vi) , idfa)) thus the 
weighted edge is represented by the triple (id(vi), id(v2), W(ej). 

As discussed in Section |2~31 by Property 12. 191 any edge e incident on a node v 
can be seen as belonging to one of two distinct sets depending on which between 
the two edges having v as source originated e through composition. We call the 
two sets of edges as LEFT set and RIGHT set, and we associate with each edge 
e an additional information, namely Side(e), indicating whether e belongs to the 
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LEFT or the RIGHT set. This information allows us to reduce the number of edge 
compositions according to the HC strategy which must be performed during the 
computation. Specifically, given two edges e and e' incident on a same node v, if 
the side of the two edges is the same, no composition involving e and e' must be 
performed at all since we a priori know that it will produce null result. On the 
other hand, if Side(e) ^ Side(e'), composition must be performed to determine the 
result, which can be either null or non-null. 

In the general case, each process Pi hosts only a subset of the nodes of the 
graph. Therefore, given an edge e = (v\,V2), there is the possibility that V\ and 
V2 are hosted by distinct processes. In Figure IT^l we show an example of this. The 
interesting point in the example is that when process Pi performs the composition 
between the edges e\ and ei incident on node v according to HC, then a new node, 
namely v' is originated together with two new edges, namely e% and e± incident 
on nodes v\ and vi respectively. The new node v' can be hosted by any process, 
and process Pi is the one which establishes where v' must be actually located; in 
our example, Pi selects Pj. We will come back to the selection issue in Section ET31 
when describing the load balancing module that establishes how new nodes must 
be distributed among processes. Note that, in case one of the newly produced edges 
should have weight one, the optimization of one rule described in Section 12.31 (see 
Figure 0, allows avoiding the real creation of that edge. Also, the only edge really 
created has as source a node already within the directed virtual net, thus no new 
node needs to be created and addressed to some process. 

In our implementation id(v') is a triple [t,Pi,Pj] where Pi is the process that 
created the node v' , Pj is the process hosting that node and t is a time-stamp value 
assigned by Pi. The time-stamp is managed by Pi as follows: it is initialized to 
zero and anytime P, originates a new node, it is increased by one. 

When the new node v' is originated by Pi, the creation must be notified to Pj. 
Furthermore, both P& and Ph must be notified of the new edges e$ and e^ incident, 
respectively, on V\ and z>2- In our implementation we use message exchange only 
for the notification of new edges, while we avoid to explicitly notify the creation of 
the new node v' to Pj . Process Pj will actually create the node v' upon the receipt 
of the first message notifying a new edge incident on v'. We will refer to this type of 
node creation as delayed creation. It allows us to reduce the amount of notification 
messages exchanged among processes. 

Applying the delayed creation technique to the example in Figure 1131 means 
that node v\ is created by P& only upon the receipt of the message carrying the 
information of the edge 63 incident on Vi (recall that this message is sent by Pi). 
Similarly, Ph will create v 2 only upon the receipt of the notification message for the 
edge £4 (also this message is sent by Pj). 

By previous considerations we get that any message exchanged between two 
processes carries the information of a new edge. Specifically, a message carrying the 
information associated with the edge e(v\, 1*2) has a payload consisting of the tuple 
[[t,Pi,Pj], [t',P u P m ],W(e),Side(e)] where id( Vl ) = [t,P h Pj], id(v 2 ) = [t',Pi,P m ], 
W(e) is the weight of e and Side(e) is the edge side. 

Pi keeps track of information related to local nodes in a list nodesi. Any element 
in nodesi has a compound structure. In the remainder of the article we identify 
the structure in nodesi associated with a node v as nodeSi(v). As relevant field of 
the structure nodesi(v) we have a list, namely nodeSiiv) .combusted, containing the 




edges incident on the node v which have already been composed (i.e. the combusted 
edges of the HC strategy). The list nodes i(v). combusted is partitioned into two 
sub-lists, namely nodeSi(v) .combusted. LEFT and nodes^v). combusted. RIGHT, 
containing edges having SideQ equal to LEFT and RIGHT respectively. 

A buffer incomingi associated with Pj is used to store received messages. For 
what we have explained above, any message stored in incomingi carries information 
related to a new edge which must be added to the virtual net and composed with 
already combusted edges, if any, incident on the same node. Such an edge is 
actually an incoming edge of the HC strategy. Therefore, the buffer incomingi 
represents a kind of work list for process Pj, as, according to HC, any incoming 
edge associated with a message stored in incomingi requires Pj to compose it with 
all the already combusted edges incident on the same node. Performing such a 
composition represents the work associated with the message carrying the edge. 

For each process Pj, except the master process Po, both incomingi and nodesi 
are initially empty, meaning that initially there is no node of the directed net 
managed by Pj, nor there are incoming edges for it. Instead, Po is such that its list 
nodeso is empty but its buffer incomings contains a set of messages, one for each 
initial edge of the virtual net (recall that the initial edges are all incoming). Note 
that this does not mean Po is a bottleneck for the parallel execution since the load 
balancing mechanism we have implemented (sec Section l4.3|) promptly distributes 
new edges produced in the early phase of the execution among all the processes. 

In Figure 1141 we show the high level structure of the algorithm implemented by 
the software modules we have developed. Before entering the pseudo code descrip- 
tion, we recall that the HC strategy is such that, any incoming edge of which process 
Pj becomes aware by extracting the corresponding message from incomingi, must 
be immediately composed with the preexisting edges incident on the same node, 
without additional delay. Furthermore, given a message m carrying the information 
of a new edge e = (vi,V2), we denote as m.target the node identified by the in- 
formation id(v2) carried by m (recall that id{v2) is the previously described triple) 
and as m. source the node identified by the information id{v\) carried on the same 
message, e.target and e. source have similar meaning when referring to an edge e. 
Also, we denote as e m the edge carried by m. 
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program Pi ; 

1 initialize{)\ 

2 while not end-computation do 



3 (collect all incoming messages and store them in incomingi) 

4 while not empty(incomingi) do 

5 (extract a message m from incomingi); 

6 if m.target E nodesi 'node already in the local list' 

7 then 

8 for each edge e G nodeSi{m.target) .combusted do 

9 if Side(e m ) ji Side(e) 

10 then 

11 (compose e m with e); 

12 (select the destination process Pj for hosting the node possibly 
originated by the composition); 

13 (send the edges produced by the composition to Pk and Ph 

hosting m. source and e. source respectively) 

14 endfor 

15 else (add m.target to nodeSi)\ 'delayed creation' 

16 (add e m to nodesi(m.target). combusted. Side(e rn )) 

17 endwhile; 

18 {end-computation — check_termination()); 



19 endwhile 



Figure 14. Pseudo code for process Pi. 



The procedure initialize^ sets the initial values for all the data structures. The 
procedure emptyQ checks whether the buffer storing received messages is empty. 
In the positive case, process Pi has no work to be performed, thus it invokes the 
procedure check termination^) to check if the computation is actually ended, i.e. 
no message will arrive (in Section l4~4l we will provide details on how the detection 
of the termination is implemented). In the negative case, it extracts a message from 
the buffer incomingi and performs the composition of the corresponding incoming 
edge. 

By the test in line 11 we exploit information about Side(e m ) to avoid unnecessary 
edge compositions. The pseudo code structure also points out that process P. \ checks 
for the presence of pending messages only when incomingi is empty (i.e. when Pj 
has no more work to be performed unless new pending messages carry it). This 
behavior aims at reducing the communication overhead. Specifically, a procedure 
to check whether there are pending messages is realized typically by using probe 
functions supported by the used communication layer. Pi invokes the execution of a 
probe function to test if there is at least a pending message. If there is at least one 
such message, then a recv procedure is executed to receive that message and store it 
into incomingi. As pointed out in other contexts DNRD96 , probe functions may 
be expensive, therefore, they should be executed only when a further delay could 
actually produce negative effects on performance. In the general case, delaying 
the probe call until all the messages stored in incomingi have been processed 
should not produce negative effects. This is the reason why, in the general case, we 
suggest to perform the probe call only when incomingi becomes empty. However, 
we noted that depending on the particular hardware/software architecture and on 
the adopted message passing layer, excessive delays in receiving pending messages 
could impact negatively on the performance of the communication layer due to 
buffer saturation. This is the case we have observed for our implementation based 
on MPI. For this reason we have done a light modification to the general code 
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structure in Figure 1141 in order to avoid excessively infrequent probe calls (and 
message receipts). 

Beyond the overhead due to probe calls, another important issue is the overhead 
related to send and receive operations. A solution to bound this overhead will be 
discussed in the following subsection. Then we will present the policy we have 
selected for balancing the load among processes and other relevant aspects related 
to the implementation. 

4.2. Message Aggregation. The cost of sending and receiving a physical mes- 
sage, paid part by the sender and part by the receiver, can be divided into two 
components: (i) an overhead that is independent of the message size, namely oh, 
and (ii) a cost that varies with the size of the message, namely s x ohb where s is 
the size (in bytes) of the message and ohb is the send/receive time per byte, oh 
typically includes the context switch to the kernel, buffer reservation time, the time 
to pack/unpack the message and, in case of distributed memory systems, the time 
to setup the physical network path. Instead, ohb takes into account any cost that 
scales with the size of the message. 

oh is usually higher than ohb, as shown in |XH96| up to two orders of magnitude, 
therefore it results usually more efficient to deliver several information units (i.e. 
more than one application message) with a single physical message, in such a way 
that a single pair of send/receive operations is sufficient to download many data 
at the recipient process. This allows the reduction of the static overhead oh for 
each information unit, thus originating efficient parallel executions, especially in 
the case of fine grain computations like DVR. As an example, if three application 
messages of size s constitute the payload of a single physical message then the cost 
to send and receive these application messages is reduced from 3o/i + 3s x ohb to 
oh + 3s x ohb- 

We present below the optimization we have embedded in the communication 
modules via the aggregation of application messages in a single physical message. 
Each process Pi collects application messages destined to the same remote process 
Pj into an aggregation buffer outJmff^j. Therefore, there is an aggregation buffer 
associated with each remote process. Application messages are aggregated and are 
infrequently sent via a single physical message. The higher the number of appli- 
cation messages aggregated, the greater the reduction of the static communication 
cost per application message; we call this positive effect Aggregation Gain (AG). 
However, the previous simple model for the communication cost ignores the effects 
of delaying application messages on the recipient process. More precisely, there 
exists the risk that the delay produces idle times on the remote processes which 
have already ended their work and are therefore waiting for messages carrying new 
work to be performed; we call this negative effect Aggregation Loss (AL). Previous 
observations outline that establishing a suited value for the aggregation window 
(defined as the number of application messages sent via the same physical message) 
is not a simple task. 

In our implementation, the module controlling the aggregation keeps an age 
estimate for each aggregation buffer outJjuffij by periodically incrementing a local 
counter c 8J . The value of a t j is initialized to zero and is set to zero each time 
the application messages aggregated in the buffer are sent. At the end of the 
composition phase of an incoming edge extracted from the local work list incomingi, 
Cij is increased by one if at least one message is currently stored in the aggregation 
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buffer outJbuffi j. Therefore, one tick of the age counter is equal to the average 
combustion time of an incoming edge and the counter value represents the age of 
the oldest message stored in the aggregation buffer. 

The simplest way to use previous counters is to send the aggregate when the 
associated counter reaches a fixed value, referred to as maximum age for the ag- 
gregate, or when the work list of the process is empty. In this case there is no 
need to delay the aggregate anymore as the probability to put more application 
messages into it in short time is quite small, so the delay will not increase AG and 
will possibly produce an increase of AL. We will refer to this policy as Fixed Age 
Based (FAB). Although this policy is simple to implement and does not require any 
monitoring for the tuning of the maximum age over which the aggregate must be 
sent, it may result ineffective whenever a bad selection of the maximum age value 
is performed. 

To overcome this problem we have implemented a Variable Age Based (VAB) 
policy, which is an extension of FAB, having similarities with an aggregation tech- 
nique presented in |GAGRW98| for communication modules supporting fine grain 
parallel discrete event simulations. In VAB, anytime the messages aggregated in 
outJbuffij are sent, the message rate achieved by the aggregate is calculated. This 
rate is used to determine what the maximum age for the next aggregate should be. 
The dynamic change of the maximum age after which an aggregate must be sent, 
allows the aggregation policy to adapt its behavior to the behavior of the overlying 
application. 

To implement VAB, Pj is required to maintain an estimate estij of the expected 
arrival rate in each aggregation buffer outJbuffij (the higher such rate, the higher 
AG for that buffer). This estimate can be computed by using statistics related to 
a temporal window. If the arrival rate for the current aggregate in outJbuffij is 
higher than estij then the maximum age for the next aggregate into that buffer is 
increased by one since the application is likely to start a period of bursty exchange 
of application messages from Pj to Pj . Therefore a slight increase in the maximum 
age is likely to relevantly increase AG. If the arrival rate falls below estij, then 
the maximum age is decreased by one (provided it is greater than one). An upper 
limit on the maximum age can be imposed in order to avoid negative effects due to 
AL (i.e. in order to avoid excessive delay for the delivery of the aggregate at the 
recipient process). 

4.3. Load Balancing. Whenever the composition between two edges is performed 
by a process Pi then a new node is originated and Pi must select a process Pj 
(possibly j — i) which will host the new node. In order to provide good balance 
of the load we have implemented a selection strategy for the destination process 
which uses approximated state information related to the load condition on each 
process. 

In our solution we identify the number of unprocessed application messages upm 
stored in the buffer incoming,; as the state information related to the load condition 
on Pj . Pi keeps track of the values of upm related to itself and to the other processes 
into a vector U PMj of size n (where n is the number of processes). U PMi[i] records 
the current value of the number of unprocessed application messages of Pi . UP Mi [j] 
records the value of the number of unprocessed application messages of Pj known 
by Pi. These values are spread as follows. Whenever Pj sends a physical message 
M to Pj, the value of UPMi[i] is piggy-backed on the message, denoted M.UPM 
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Whenever a physical message M sent by Pj to Pi is received from Pi, then 
UPM^j] is updated from M.UPM (i.e. UPMi[j] <- M.UPM). The information 
on the load conditions kept by the UPM vectors is approximated for two reasons: 

• there exists the possibility that when Pj receives M from Pj the current 
value of UPM [j] is different from M.UPM; 

• the current value of UPMi[i) is not an exact representation of the current 
load of Pi as it does not count application messages carried by pending 
physical messages; these application messages represent work to be per- 
formed which has not yet been incorporated into the buffer incomingi. 

We note, however, that obtaining more accurate state information on the load 
condition of a process would require the exchange of additional physical messages 
or, at worst, a synchronization among processes which could produce unacceptable 
negative effects on the performance. Anyway, it is important to remark that the 
FIFO property for the communication channels guarantees that each time a physical 
message M sent by Pj is received from Pj, the piggy-backed value M.UPM refers 
to a more recent load condition as compared to the one indicated by the current 
value of UPMi[j}. 

Based on the values stored in UP Mi, we have implemented a selection policy for 
the destination process of a new node which is a modified round-robin. It works as 
follows. Pi keeps a counter rrj initialized to zero which is updated (module n) each 
time a new node is produced by Pj . The current value of rr j is the identifier of the 
process which should host the new node according to the round-robin policy. Pj 
actually selects P rri as destination if f7PMj[rr,] < UPMi[i]; otherwise Pj selects 
itself as destination for the new node. In other words, each process distributes the 
load in round-robin fashion unless, at the time the load distribution decision must 
be taken, the local load is lower than that of the remote process which should be 
selected. 

4.4. Termination Detection. The implementation of the termination detection 
relies on the use of additional control messages. Specifically, anytime a process dis- 
tinct from the master Pq has no more work to be performed, it sends to Pq a message 
carrying information about both the number of application messages received from 
other processes, which have already been processed and the number of application 
messages produced for other processes. Such a message will be referred to as status 
message. When the master P detects that each process has already elaborated all 
the application messages produced for it, Pq discovers that the computation is over 
and notifies the termination to the slave processes. This is done through the send 
of a termination message. By looking at the structure of the code in Figure PHI 
it can be seen that process Pj executes the check -termination^) procedure only 
when no work to be performed has been detected (i.e. when incomingi is empty). 
This points out that no synchronization is required (i.e. Pi sends its status mes- 
sage without blocking to receive an acknowledgment; it will possibly receive the 
termination message during a future execution of the check J,ermination() proce- 
dure) ( 2 ). On the other hand, the master Pq checks for incoming status messages 



We use "M" to denote a physical message in order to distinguish it from an application 
message previously denoted as "m" . 

2 Actually, to keep low the overhead due to status messages, Pj sends one such message to the 
master only if its status has changed since the last status message was sent. 
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and possibly sends the termination messages only when it has no more work to be 
performed. Note that the consistency of the information collected by Pq through 
status messages is guaranteed by the FIFO property of communication channels. 

4.5. On-the-Fly Storage Recovery. At the end of the computation we get that 
for all the nodes of the final graph the incident edges are ghost. However only some 
of those nodes belong to the normal form of the reduction. We recall that the nodes 
of the initial graph belonging to the normal form are the nodes initially having only 
ghost incident edges; this set of nodes will be referred to as the border. Starting 
from the border, we can determine the whole normal form: it contains those nodes 
linked to the border by a directed path. 

We have embedded in the implementation a technique to discard on-the-fly nodes 
that do not belong to the normal form. We have taken this design choice for keeping 
low memory usage with the twofold aim of (i) increasing the efficiency of the un- 
derlying virtual memory system, and (ii) allowing efficient management of the data 
structures maintained at the application level. As respect to the latter issue, any- 
time an unprocessed application message m is extracted from the buffer incomingi 
(see line 5 of the pseudo code in Figure IHf) . process Pi must access information 
associated with the node m.target, if it already exists. Such an information is 
maintained in the structure nodeSi(m.target) (see lines 6 on of the pseudo code in 
Figure [EJ- To retrieve the virtual memory address for this structure we have used 
an hash table with chaining for handling collisions, which keeps an active entry for 
each node in the list nodesi . Discarding nodes of valence zero that do not belong 
to the normal form allows keeping low the number of entries of the hash table, thus 
allowing efficient access to the table at anytime. 

The on-the-fly storage recovery technique we have implemented tracks whenever 
a node becomes of valence zero and removes it if there is no directed path towards 
nodes of the border. This is implemented through a particular type of application 
messages we call EOT (End-of- Transmission) messages. Specifically, for each initial 
node v that does not belong to the border we insert an EOT message. If we ensure 
that EOT messages are processed only after all the messages carrying edges destined 
to v have been already processed, then we detect upon processing of the EOT mes- 
sages that no new edge will have v as its target. This means that node v, and all the 
edges pointing to it, can be deleted. Before removing this node, the EOT message 
is propagated to the sources of the edges pointing to v. We note that, since each 
node is source of two edges, we expect the arrival of two EOT messages (one from 
both sides) before handling the removal of the node. Therefore, if for a node v we 
have no EOT message destined to it, or at most only one of such messages, it means 
that v has a directed path to the border, thus it belongs to the normal form. In 
this case no removal takes place. 

Actually, guarantees that the EOT messages will be processed only after all the 
messages carrying incoming edges destined to the same node have already been 
processed, is trivially achieved thanks to the FIFO property of communication 
channels. 

5. Experimental Results 

In this section we report experimental results demonstrating the effectiveness 
of our implementation, and thus of both the HC strategy underlying it and the 
combination of all the optimizations for the run-time behavior we have presented 
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and embedded within PELCR. As already pointed out, the implementation has 
been developed using the C language and MPI as the underlying message passing 
layer. A major advantage of using such a standard interface for message passing 
functionalities is that it makes the software highly portable. This allowed us to test 
the implementation on a wide set of platforms such as SMP machines with Linux, 
IBM mainframes like SuperPowcr 3 and SupcrPower 4 with AIX, shared memory 
Sun Ultra Sparc machines and Alpha Digital MicroChannel clusters. 

In this section we report performance results obtained in the case of an IBM 
server pSeries 690, with 32 Power4 CPUs - 1.3 GHz, 64 GB RAM, running IBM 
AIX 5.2 ML1+ as the operating system. We have selected the results obtained with 
this machine as representative especially because of the larger number of available 
processors, as compared to the other architectures. This allows us to better observe 
whether and how the performance provided by PELCR scales while increasing the 
computing power. 

Before entering details related to the experimental results, we note that there 
exists an approach to reduce the computation time in optimal reduction systems 
based on an optimization known as safe operators |AC97) . It allows the merging of 
many control operators in a compound one acting as the sequence, thus exhibiting 
the ability to strongly decrease the number of interactions. Actually, this approach 
has been the basis for the implementation of the so called BOHM (Bologna Higher 
Order Machine) , which is a sequential machine for optimal reduction that has been 
demonstrated to provide better performances compared to non-optimal interpreters 
such as CAML and HASKELL (sec AG9B)). Compared to this approach, we tackle 
the issue of increasing the speed of the reduction in an orthogonal way. Specifically, 
we do not use merging of operators to reduce the number of interactions, instead 
we exploit computing capabilities of multiple processors within the architecture 
to keep low the reduction time via parallel computation of the reduction itself. 
This approach can be applied to any computation issued from the Geometry of 
Interaction. Note that we experimentally observe speedup in parallel evaluation 
of terms typable in systems with intrinsic complexity ELL (or LLL) |Gir95| : these 
terms are evaluated in such a way that the safe operators optimization is not needed. 
As a consequence, our approach is expected to speedup the execution even for the 
cases in which safe operators cannot be effectively employed, or even when optimal 
reduction does not affect the efficiency of computation like in non-higher order 
terms. 

The results we report in this section refer to the following two different test cases: 

• DD4, which corresponds to the A-term (<5)(<5)4 where 6 — Xx(x)x represents 
the self application. The normal form of this term represents the Church's 

4 4 

integer (4 ) . 

• EXP3, which corresponds to the ELL term 

Ite((Mult)2, 1, Ite((Mult)2, 1, Ite((Mult)2, 1, 4))), 

whose normal form represents the iterated exponential 2 . For a precise 
relation between this ELL term and the multiplicative linear logic proof 
net from which the dynamic graph to be executed is obtained, we refer the 
reader to |Ped96| . 

For both these two cases, the shared result of the HC strategy has a number of 
nodes which is on the order of hundreds of thousands (for DD4 this number even 
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reaches about one million and half), therefore they are large enough case studies 
to stress the behavior of our implementation. 

Before presenting the results, we provide details on the main parameters we have 
measured in the experiments. 

5.1. Measured Parameters. A measure of success of any parallel implementa- 
tion is how significantly it accelerates the computation. Typically the acceleration 
is expressed by the so called speedup, evaluated as the ratio between the sequential 
execution time on a single processor and the parallel execution time on multi- 
ple processors. Actually, this is a fundamental parameter to consider, not only 
because it expresses the amount of increase in the execution speed while increas- 
ing the power of the underlying computing system, but also because the speedup 
curve provides indications on how the execution speed scales while increasing the 
computing power. Linear speedup means that the execution speed scales linearly 
vs the computing power. This is an indication that the parallel implementation 
maintains the same effectiveness independently of the number of used processors, 
thus the implementation itself docs not suffer, e.g., from excessive increase in the 
communication overhead while increasing the number of processes involved in the 
parallel execution. Actually, we also report the ratio between the observed speedup 
and the ideal speedup that can be achieved with a given degree of parallelism, i.e. 
with a given amount of used processors. (We recall that the ideal speedup on n 
processors is equal to n, which means we experience no overhead but only gain 
by distributing the work to be performed on the n processors.) This parameter 
provides indications on the extent to which the parallel implementation can be 
considered effective, independently of the shape of the speedup curve. Specifically, 
if we have a linear speedup curve but a low ratio over the ideal speedup, it means 
that the parallel implementation, although not particularly suffering from increase 
in the overhead due to the parallelization while increasing the amount of processors, 
is anyway ineffective, e.g. due to inadequate structuring of the parallel algorithm 
it implements. 

Beyond speedup, another parameter we report is the wall-clock time for the 
reduction. This parameter expresses the real time cost for a given reduction and 
also how it varies while increasing the computing power of the underlying platform. 
It is a fundamental parameter to report since it provides indications on whether the 
speedup curve has been evaluated over a representative interval for what concerns 
the number of used processors. Specifically, if wall-clock time values of few seconds 
or less are achieved while increasing the number of processors, then it means that 
an additional increase in the computing power does not make sense for this specific 
reduction (this is because response time of few seconds or less is typically considered 
satisfactory even for the case of an interactive end-user, i.e. the case in which 
responsiveness is actually a critical issue to address), hence the speedup has been 
evaluated over an adequate interval for what concerns the degree of parallelism. 

The wall-clock time and the speedup are parameters that express the effective- 
ness of the parallel implementation when evaluated globally. However, we are also 
interested in observing the effects of specific optimizations wc have proposed. As 
respect to this point, we also report data that allow the evaluation of the benefits 
from the VAB message aggregation technique discussed in Section 14.21 and the ef- 
fectiveness of the load balancing policy presented in Section 14.31 To evaluate how 
VAB impacts the communication cost while increasing the amount of processors, 
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Figure 15. Speedup and Wall-Clock Time results for DD4. 

we report the product between the average number of application messages deliv- 
ered through a single MPI message, i.e. the average size of the aggregate which 
we will refer to as AAS, and the number of used processors. This product is rep- 
resentative of the system capacity to send application messages at the time cost of 
sending a single MPI message. Specifically, when using n processors, the hosted 
processes can perform send operations of MPI messages concurrently. Therefore, 
within the wall-clock time of a single send operation, we are, on the average, able 
to send n MPI messages in parallel. As a consequence, if the product between AAS 
and the number of processors increases, we have that the time cost for the send 
of each application message gets reduced, with consequent reduction of the overall 
communication overhead on each processor. (For completeness, we also report the 
plot for AAS, so as to show its behavior while increasing the number of proces- 
sors.) For what concerns load distribution, we report plots related to the variation, 
over time, of the amount of unprocessed application messages, namely upm, stored 
in the incoming buffer at different processes (recall that upm has been used in 
Section 14.31 as the information on current load on each processor to determine the 
distribution of new nodes dynamically originated during the computation). This 
parameter is representative of the effectiveness of the load balancing strategy we 
have adopted since it provides indications on whether the work list keeping track of 
the amount of edges to be composed is approximatively the same on all processes 
at any time during the execution. 

5.2. Results for DD4. The experimental measures obtained for DD4 are reported 
in Figures El El an d El By the plots in Figure El we observe that the speedup 
curve remains linear over the whole interval for what concerns the number of user 
processors (i.e. up to 32); also the speedup value is constantly on the order of the 




Figure 16. VAB message aggregation results for DD4. 



70% of the ideal speedup. Combined toghether, these two plots indicate that the 
parallel implementation is effective for what concerns both the structuring of the 
parallel algorithm (this provides the ability to reach high values with respect to the 
ideal speedup) and the capability of remaining performance efficient while increasing 
the amount of used processors. Also, the wall-clock time curve, always reported 
in Figure 1151 demonstrates that the speedup plots provide reliable performance 
indications, in the sense that speedup has been evaluated over an adequate interval 
for what concerns the amount of used processors. Specifically, with 32 processors, 
the wall-clock time for the computation gets on the order of 2.2 seconds, which is 
not only a definitely reduced value as compared to the sequential execution time 
(i.e. the execution time on a single processor, namely about 50 seconds), but also 
represents a satisfactory response time for an interactive end-user. 




Figure 17. Variation of upm over time for DD4 (single processor 
case on the left - four processors case on the right). 



The plots related to the behavior of the VAB aggregation technique in Figure [TBI 
and to the variation of upm over time in Figure flTI additionally help understanding 
the reason why the implementation remains effective while increasing the amount 
of used processors. The AAS curve in Figure El shows that the average amount of 
application messages aggregated within each MPI message gets reduced from about 
11.5 to about 3 while moving from the single processor execution to the execution 
on 32 processors. This is an expected behavior when thinking that a larger amount 
of used processors means that each process Pi involved in the parallel execution 
needs to manage an increased amount of channels towards other processes. As a 
consequence the application messages produced by Pi must be distributed over a 
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larger amount of aggregation buffers emtJmffij, which means a reduced capacity 
to aggregate messages within a given time unit in each single buffer. However, 
observing the curve related to AAS multiplied by the number of used processors, 
always in Figure lTBl we have a clear indication that the system capacity to send ap- 
plication messages at a given cost linearly increases with the number of processors, 
with a slope of about 0.8 (recall the ideal case for the curve of AAS multiplied by 
the amount of processors would be for slope equal to 1). Specifically, when mov- 
ing from k processors to 2k processors, the system increases its capacity of sending 
application messages at the same time cost of about 1.6 times, which is a clear indi- 
cation that VAB allows the communication overhead to scale well vs the size of the 
underlying computing system. For what concerns the variation of upm over time, 
in Figure IT7I we report both the case of single processor execution and the case of 
execution on four processors. By the plots we observe that the load of unprocessed 
application messages, stored by each process Pi within the incomingi buffer, is well 
distributed on each of the four processors during the whole execution period, thus 
supporting the claim of the effectiveness of the load balancing mechanism described 
in Section IP 

5.3. Results for EXP3. The data obtained for the EXP3 application allow, in the 
light of the already observed results for DD4, the determination of additional infor- 
mation on the run time behavior of our implementation, also in terms of the effects 
of the particular application on the achievable performance. The main difference 
with respect to the case of DD4, is in that this time the speedup curve does not 
remain linear vs the number of processors. Specifically, the plots in Figure IT51 show 
that the speedup asymptotically tends to a constant value (on the order of 12), 
with a consequent decrease of the ratio over the ideal speedup. By the data related 
to the effects of VAB and to the variation of upm over time, it can be deduced that 
the cause for such a behavior is not due to ineffectiveness of the parallel implemen- 
tation (e.g. in terms of increase of the communication overhead while increasing 
the amount of processors or load imbalance). Specifically, the curve in Figure IT§1 
related to AAS multiplied by the number of used processors clearly shows that 
also in this case the implementation is able to carefully control the communication 
overhead while the size of the underlying computing system gets increased. More 
precisely, the linearity of such a curve, with slope on the order of about 1, i.e. the 
ideal slope value, provides indication that the implementation is able to control the 
communication overhead even in a more effective way that what done for the case 
of DD4 in the previous section (recall the slope for the same curve for DD4 was 0.8, 
thus lower than what we observe in this case). Also, the plots for upm in Figure 
1201 show that load remains balanced while moving from the single processor case to 
the execution on multiple processors. 

Actually, that type of behavior for the speedup curve is due to the fact that 
EXP3 exhibits a final computation phase made of a very limited amount (i.e. few 
units) of unprocessed application messages. This can be clearly observed when 
looking at the plots in Figure 1201 related to the variation of upm over time. In 
other words, during that final phase the computation becomes intrinsic sequential 
(few unprocessed application messages produce, once extracted from the incoming 
buffer and composed through HC, few new application messages carrying new edges 
for the virtual net). As a consequence, during the final phase, parallelism cannot be 
exploited for EXP3, which is exactly the reason why speedup asymptotically tends 
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Figure 18. Speedup and Wall-Clock Time results for EXP3. 

to a constant value. This type of problem has been already observed by Mackie 
in |Mac97| for the case of parallelism in the form of adequate assignment of the 
initial nodes of the interaction net over the used processors (recall this solution 
is based on a static analysis of the initial interaction nets and differs form our 
proposal in that we dynamically control load distribution, and other performance 
indexes at run time). Specifically, also for Mackie's approach, the benefits from the 
exploitation of multiple processors in the computing system are bounded by phases 
of sequential computation, if any, intrinsic to the specific application. However, 
as an additional support to the fact that our implementation is definitely able to 
exploit parallelism, whenever present within specific execution phases, we note that 
in case the speedup results were computed by excluding the final sequential phase 
(by the plots in FigureEUlsuch a phase lasts about three seconds, which is the reason 
why the wall-clock time asymptotically tends to three seconds while increasing the 
number of used processors), they would be even better that those obtained for the 
case of DD4 in the previous section. Specifically, speedup would be on the order of 
at least 75% of the ideal over the whole interval for the amount of used processors. 

6. Summary 

The definition of a formal system in [DR93I IDPR97| for the computation of the 
execution formula |Gir89| of the A-calculus terms constitutes the original starting 
point for the work we have presented in this article. Such a formal system was 
initially motivated as the mathematical settlement of an operational semantics for 
A-calculus and for other functional programming languages. In making the precise 
definition of this system, properties of locality and asynchrony pointed out the 
potential for distributing the execution of programs. 
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Figure 19. VAB message aggregation results for EXP3. 
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Figure 20. Variation of upm over time for EXP3 (single processor 
case on the left - four processors case on the right). 

The main contribution of this work relics on showing how it is possible to make 
a functional language, based on A-calculus, transparently executable on a paral- 
lel/distributed environment. This result has been achieved by exploiting the de- 
composition of beta-reduction into a set of more elementary execution steps, each 
one independent of the others, which gives an execution model extremely flexible 
and very prone to be supported by a parallel/distributed environment. Specifically, 
we exploited the properties of locality and asynchrony of directed virtual reduc- 
tion, namely the formal system providing the previously mentioned decomposition 
to develop the PELCR software package. This package manages the distribution of 
computational load due to the evaluation of a A-term in a totally transparent way, 
by dynamically controlling/tuning any parameter potentially affecting the efficiency 
of the run-time behavior. 

Our presentation integrates a solid theoretical background with many practical 
techniques coming from current parallel computing developments and provides a 
full featuring facility for the parallel/distributed execution of functional programs. 
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